Transparent Fault Tolerance for Grid Applications

نویسندگان

  • Pawel Garbacki
  • Bartosz Biskupski
  • Henri E. Bal
چکیده

A major challenge facing grid applications is the appropriate handling of failures. In this paper we address the problem of making parallel Java applications based on Remote Method Invocation (RMI) fault tolerant in a way transparent to the programmer. We use globally consistent checkpointing to avoid having to restart long-running computations from scratch after a system crash. The application’s execution state can be captured at any time also when some of the application’s threads are blocked waiting for the result of a (nested) remote method call. We modify only the program’s bytecode which makes our solution independent from a particular Java Virtual Machine (JVM) implementation. The bytecode transformation algorithm performs a compile time analysis to reduce the number of modifications in the application’s code which has a direct impact on the application’s performance. The fault tolerance extensions encompass also the RMI components such as the RMI registry. Since essential data as checkpoints are replicated, our system is resilient to simultaneous failures of multiple machines. Experimental results show negligible performance overhead of our fault-tolerance extensions.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Checkpointing and Migration of Communication Channels in Heterogeneous Grid Environments

A grid checkpointing service providing migration and transparent fault tolerance is important for distributed and parallel applications executed in heterogeneous grids. In this paper we address the challenges of checkpointing and migrating communication channels of grid applications executed on nodes equipped with different checkpointer packages. We present a solution that is transparent for th...

متن کامل

Stability Assessment Metamorphic Approach (SAMA) for Effective Scheduling based on Fault Tolerance in Computational Grid

Grid Computing allows coordinated and controlled resource sharing and problem solving in multi-institutional, dynamic virtual organizations. Moreover, fault tolerance and task scheduling is an important issue for large scale computational grid because of its unreliable nature of grid resources. Commonly exploited techniques to realize fault tolerance is periodic Checkpointing that periodically ...

متن کامل

PastryGridCP: A Decentralized Rollback-Recovery Protocol for Desktop Grid Systems

Desktop Grids are composed of several thousands of resources. They are characterized by high volatility of resources, due to voluntary disconnections or failures. This could affect the proper termination of applications execution. PastryGrid is a decentralized system which manages desktop grid resources and user applications over a fully decentralized P2P network. In this paper we present Pastr...

متن کامل

ProActive: an Integrated platform for programming and running applications on grids and P2P systems

We propose a grid programming approach using the ProActive middleware. The proposed strategy addresses several grid concerns, which we have classified into three categories. I. Grid Infrastructure which handles the resource acquisition and creation using deployment descriptors and Peer-to-Peer. II. Grid Technical Services which can provide non-functional transparent services like: fault toleran...

متن کامل

MPICH-GF: Transparent Checkpointing and Rollback-Recovery for Grid-Enabled MPI Processes

Fault-tolerance is an essential element to the distributed system which requires the reliable computation environment. In spite of extensive researches over two decades, practical fault-tolerance systems have not been provided. It is due to the high overhead and the unhandiness of the previous fault-tolerance systems. In this paper, we propose MPICH-GF, a user-transparent checkpointing system f...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2005